System and method for multiple-factor selection

ABSTRACT

The disclosed subject matter provides techniques for multiple-factor selection. The factors can be features or elements that are jointly associated with one or more outcomes by their joint presence or absence. There may be a non-causative correlation between the factors, features, or elements and the outcomes. In some embodiments, Entropy Minimization and Boolean Parsimony (EMBP) is used to identify modules of genes jointly associated with disease from gene expression data, and a logic function is provided to connect the combined expression levels in each gene module with the presence of disease. The smallest module of genes whose joint expression levels can predict the presence of disease can be identified.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority to U.S. Provisional ApplicationSer. No. 60/748,662 filed Dec. 7, 2005 and U.S. Provisional ApplicationSer. No. 60/754,102 filed Dec. 27, 2005, the entire contents of each axeincorporated by reference herein.

BACKGROUND

The disclosed subject matter relates generally to techniques for factorselection, including factors useful in gene express analysis.

The expression levels of thousands of genes, measured simultaneouslyusing DNA microarrays, may provide information useful for medicaldiagnosis and prognosis. However, gene expression measurements have notprovided significant insight into the development of therapeuticapproaches. This can be partly attributed to the fact that whiletraditional gene selection techniques typically produce a “list ofgenes” that are correlated with disease, they do not reflect anyinterrelationships of the genes.

Gene selection techniques based on microarray analysis often involveindividual gene ranking depending on a numerical score measuring thecorrelation of each gene with particular disease types. The expressionlevels of the highest-ranked genes tend to be either consistently higherin the presence of disease and lower in the absence of disease, or viceversa. Such genes usually have the property that their joint expressionlevels corresponding to diseased tissues and the joint expression levelscorresponding to healthy tissues can be cleanly separated into twodistinct clusters. These techniques are therefore convenient forclassification purposes between disease and health, or between differentdisease types. However, they do not identify systems of multipleinteracting genes, whose joint expression state predicts disease.

There is therefore a need for an approach that identifies modules ofgenes that are jointly associated with disease from gene expressiondata. There is also a need to for an approach that will provide insightinto the underlying biomolecular logic by producing a logic functionconnecting the combined expression levels in a gene module with thepresence of disease.

SUMMARY

The disclosed subject matter provides techniques for multiple-factorselection. The factors can be features or elements that are jointlyassociated with one or more outcomes by their joint presence or absence.There may be a non-causative correlation between the factors, features,or elements and the outcomes.

In some embodiments, Entropy Minimization and Boolean Parsimony (EMBP)is used to identify modules of genes jointly associated with diseasefrom gene expression data, and a logic function is provided to connectthe combined expression levels in each gene module with the presence ofdisease. The smallest module of genes whose joint expression levels canpredict the presence of disease can be identified.

In accordance with an aspect of the disclosed subject matter, thesimplest logic function connecting these genes to achieve thisprediction can be identified. In one example, EMBP analysis can beapplied on a prostate cancer dataset, and the resulting gene modules andlogic functions are validated on a different dataset.

In one embodiment, the disclosed subject matter provides a method forselecting factors from a data set of measurements where the measurementsinclude values of the factors and outcomes. Two or more factors that arejointly associated with one or more outcomes are identified from thedata set, and each of the factors are analyzed to determine at least oneinteraction among the factors with respect to an outcome.

The two or more factors can be a module of factors, and the at least oneinteraction can be a structure of interactions. Preferably, the at leastone interaction is a logic function.

In another embodiment, the two or more factors are two or more genes,the data can be gene expression data including expression levels, andthe one or more outcomes can be presence or absence of a disease. Thetwo or more genes can be a module of genes, and such that the smallestmodule of genes with joint expression levels are used for a predictionof the presence or absence of disease with high accuracy. Further, thelogic function can be the simplest logic function connecting the genesto achieve the prediction.

In another embodiment, a method for gene selection can be provided. Themethod can be used for selecting two or more genes from gene expressiondata, the gene expression data including expression levels for each ofthe two or more genes. The method includes providing gene expressiondata for two or more genes, where the gene expression data includesexpression levels for each of the tow or more genes. The method alsoincludes discretizing the gene expression data, identifying the two ormore genes with a minimal conditional entropy and identifying aninteraction that connects the expression levels in the two or more geneswith presence of a disease. The gene expression data can be derived froma microarray of gene expression data, or, alternatively, the two or moregenes can be a module of genes. Preferably, the interaction is the mostparsimonious Boolean function.

In another embodiment, a system for selecting two or more genes fromgene expression data, the gene expression data including expressionlevels for each of the two or more genes, is provided. The systemincludes at least one processor coupled to a computer readable medium,the computer readable medium storing instructions which when executedcause the processor to provide gene expression data for the two or moregenes, discretize the gene expression data, choose a single thresholdfor each of the two or more genes, identify the two or more genes with aminimal conditional entropy, and identify an interaction that connectsthe expression levels in the two or more genes with presence of adisease. In this embodiment, the gene expression data includesexpression levels for each of the two or more genes.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an illustrative drawing of entropy minimization.

FIG. 2 is an illustrative drawing of boolean parsimony.

FIG. 3 is an illustrative drawing of the disclosed subject matter.

FIGS. 4(A)-4(C) are illustrative drawings showing one example of anestimation of a Boolean function for a gene module.

FIG. 5(A) and FIG. 5(B) are graphs representing minimum entropy acrossdifferent thresholds.

FIGS. 6(A)-6(E) are Karnaugh maps leading to Boolean functions acrossdifferent thresholds.

DETAILED DESCRIPTION

According to one aspect of the disclosed subject matter, a method formultiple factor selection is provided. The method includes identifyingfactors jointly associated with an outcome from a data set for aplurality of factors, and analyzing each of the plurality of modules todetermine a structure of interactions among the factors with respect tothe outcome. The data set can be a set of measurements that includesvalues of the factors and the outcome.

One important application of the disclosed subject matter, which will bedescribed in detail below is for the inference of disease-relatedmolecular logic from a systems-based microarray of gene data. Althoughthe following will be described for disease data it can be moregenerally applicable to other data sets. Other applicable data setsinclude other biological data such as how cells are influenced bystimuli jointly, financial data, Internet traffic data, scheduling datafor industries, marketing data, and manufacturing data, for example.Table 1, below, specifies other data sets relevant to variousobjectives, including factors and outcomes, to which the disclosedsubject matter can also be applied:

TABLE 1 Objective Factors Outcomes Disease pathway Gene expressionDisease identification Synaptic specificity Gene expression Neuralsynapses factors Disease susceptibility Single Nucleotide Disease ofspecific genotypes Polymorphisms (SNPs) Genotypic basis for SNPs Geneexpression gene expression profiles Gene regulation Gene expression Geneexpression factors Gene expression Gene expression SNP association withindividual SNPs Pharmacogenomics SNPs Drug resistance Drug side-effectSNPs Side effect of drug modeling Stocks/bonds/currencyStocks/bonds/currency Sell/Buy at given selling/buying price price timeseries data price identification Macroeconomic Macroeconomic timeFederal interest rate models series such as consumer increase/decreaseindex, housing market index, trade balance, etc

In accordance with the disclosed subject matter, microarray expressiondata is discretized. For example, the data can be binarized into twolevels. Although the EMBP methodology can be generalized to account formultiple expression levels, the binarization of expression datasimplifies the presentation of the concepts and provides simple logicalfunctions connecting the genes within the found modules. Other levels ofdiscretization, such as trinarization, can also be used.

Rather than independently binarizing each gene's expression level, whichwould be more appropriate for an individual gene ranking approach,single thresholds are used for the genes. This approach is consistentwith the fact that finding global interrelationships among genesdesirable for researchers and that the microarray data have already beennormalized across the tissues and genes. Therefore, a choice of highthreshold will identify the genes that are “strongly” expressed, while achoice of a low threshold will identify the genes that are expressedeven “weakly,” EMBP analysis can be performed across several thresholdsand to determine the threshold levels that provide optimizedperformance, as described below.

Following binarization, each gene can be assumed to be either expressedor not expressed in a particular tissue. It can also be assumed thatthere are two types of tissues, either healthy ones or tissues sufferingfrom a particular disease. The latter assumption can also be generalizedto include more than two types of tissues, or modified to be used forclassification among several types of cancer. Thus, given M genes and Ktissues, an M×K binary “expression matrix” E can be defined so that aE(i,j) is 1 if gene i is expressed in tissue j, and 0 otherwise.Furthermore, a K-vector c can be defined so that c(j) is 1 if tissue jis diseased and 0 if it is healthy.

For each gene module of size n there are 2 possible gene expressionstates. For each state S the number N₀(S) of times that the stateappears in a healthy tissue can be counted, and the number of timesN₁(S) that it appears in a diseased tissue can also be counted. A tablecan be created with 2′ rows corresponding to the gene expression states(a “state-count table”), in which each row contains the two counts N₀and N₁ for the corresponding state. Table 2, illustrates two examples ofsuch state-count tables for n=4.

TABLE 2 a b c d N₀ N₁ a b c d N₀ N₁ 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0 1 0 00 0 0 1 5 0 0 0 1 0 0 0 0 0 1 0 8 0 0 0 1 1 0 0 0 0 1 1 2 0 0 1 0 0 0 00 1 0 0 1 19 0 1 0 1 1 0 0 1 0 1 0 13 0 1 1 0 12 21 0 1 1 0 0 2 0 1 1 110 10 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 4 0 1 0 0 1 2 1 1 0 0 1 12 0 1 0 10 0 0 1 0 1 0 4 0 1 0 1 1 0 0 1 0 1 1 6 0 1 1 0 0 0 0 1 1 0 0 1 3 1 1 01 8 3 1 1 0 1 2 0 1 1 1 0 2 1 1 1 1 0 0 0 1 1 1 1 15 16 1 1 1 1 5 0 H =0.951 H = 0.088

Referring to FIG. 1, a method for identifying the set of n genes whosecombined expression levels predicts the presence or absence of diseasewith minimum uncertainty will be described.

This can be referred to as “entropy minimization,” because theuncertainty can be quantified with the information theoretic measureknown as conditional entropy. A probabilistic model can be created, inwhich probabilities are equal to relative frequencies derived from thecounts N₀(S) and N₁(S), so that the presence of disease and the geneexpression states are random variables.

Specifically, the probability of encountering expression state S in atissue chosen at random can be defined as P(S) according to Equation(1):

$\begin{matrix}{{P(S)} = {\frac{{N_{0}(S)} + {N_{1}(S)}}{K}\text{:}\mspace{14mu} \begin{matrix}{{Probability}\mspace{14mu} {of}\mspace{14mu} {encountering}\mspace{14mu} {expression}} \\{{state}\mspace{14mu} S\mspace{14mu} {in}\mspace{14mu} a\mspace{14mu} {tissue}\mspace{14mu} {chosen}\mspace{14mu} {at}\mspace{14mu} {random}}\end{matrix}}} & (1)\end{matrix}$

Additionally, the probability of disease in the tissue given that itsexpression state S can be defined as Q(S) according to Equation (2):

$\begin{matrix}{{Q(S)} = {\frac{N_{1}(S)}{{N_{0}(S)} + {N_{1}(S)}}\text{:}\mspace{14mu} \begin{matrix}{{Probability}\mspace{14mu} {of}\mspace{14mu} {disease}\mspace{14mu} {in}\mspace{14mu} a\mspace{14mu} {tissue}} \\{{given}\mspace{14mu} {that}\mspace{14mu} {is}\mspace{14mu} {expressing}\mspace{14mu} {state}\mspace{14mu} {is}\mspace{14mu} S}\end{matrix}}} & (2)\end{matrix}$

If the expression state S for a particular tissue is known, then theuncertainty of determining whether or not disease exists in that tissuecan be measured by the entropy H(Q(S)), where the function H can bedefined by Equation (3):

H(q)q log₂(a)−(1−q)log₂(1−q)  (3)

The average overall uncertainty of determining whether or not disease ispresent can then be measured by the “conditional entropy” of thepresence of disease given the expression state for the gene set, definedas by Equation (4):

ΣP(S)H(Q(S))  (4)

The summation is over the 2 states S with P(S)>0. Finally, to ensurethat the range of possible values for the conditional entropy extendsfrom 0 to 1, normalization can be performed by dividing by H(Q_(null)),the entropy corresponding to the probability of disease in a randomlychosen tissue. In the case of the prostate data that is used, thisprobability is equal to 52/102. For simplicity, in the specification,the normalized conditional entropy is often referred to as just“entropy.”

The conditional entropy, as defined above, depends on the counts N₀ andN₁ and for the 2′ states. Its interpretation as a measure of uncertaintyis illustrated in the example of Table 1, which contains two state-counttables which were created using the binarized expression matrix forprostate tissues used in this paper, and a threshold of 15. Thestate-count table on the left corresponds to a choice of four genes a,b, c, d, selected at random, which in this case happen to have accessionnumbers a: AF038193, b: M22632, c: X87949, d: A1926989. The resultingvalue of the normalized conditional entropy of 0.951 is typical forrandom choices of gene sets. On the other hand, the state-count table onthe right corresponds to the gene set for which the minimum normalizedconditional entropy of 0.088 was found, consisting of genes a:COL4A6,b:CYP1B1, c:SERPINB5, and d:GSTP1, to be discussed below. In this lattergene set choice, as shown in Table 1, the reduced entropy is manifestedby the fact that the statistics are skewed for nearly all states. Forexample, all 13 tissues corresponding to state 0101 are cancerous, andall 12 issues corresponding to state 1001 are healthy.

Entropy minimization can be directed to identifying the gene set withthe minimum conditional entropy, as defined above, among the subsets ofsize n of the full set of M genes. The number of these subsets is equalto:

$\begin{matrix}\begin{pmatrix}M \\n\end{pmatrix} & (5)\end{matrix}$

and becomes large for n≧3, making the exhaustive search methodimpractical. As explained below, however, this can be addressed usingheuristic search optimization methods 104.

A combination of two heuristic optimization techniques can be used tosearch for minimum entropy gene sets, allowing sufficient time for eachof them to converge. The first technique is depicted in FIG. 1.Initially, the binarized gene expression and disease data is gathered101. Then, starting from a randomly chosen gene set of size n 102, foreach part of the iteration, the “current” gene set is modified byreplacing one of its genes, chosen at random, with a new gene, alsochosen at random from the entire gene set M 109. If the conditionalentropy 103, 105 of the new gene set is lower than that of the currentgene set, then the new gene set replaces the current gene set. Theprocess terminates when the conditional entropy is zero 108, 110, orwhen the current gene set remains unmodified for a large number of times106.

Traditional optimization methodology is used. The number of iterationscan range from 10 to 100 or more. To avoid selecting a local minimum,the same iterative algorithm can be repeated several times starting fromdifferent initial conditions of the same size and select the gene setthat yields the overall lowest conditional entropy. The size of the geneset can then be increased 109 to n+1, and the whole process repeated,making sure that one of the chosen initial conditions contains thepreviously found gene set. This technique converges to some choice ofnear-optimum results.

In order to reduce the chance that the found solution corresponds to alocal, rather than a global minimum, a Simulated Annealing (SA) approachcan be used to search in the space of the subsets of size n. In an“annealing” process, a melt, initially disordered at high“temperatures,” can be slowly cooled. As cooling proceeds, the systembecomes more ordered and approaches a “frozen” state when thetemperature equals zero.

As applied to the gene data set, at high temperatures the algorithmmodifies the “current” gene set of size n by replacing k genes, chosenat random, with k new genes (k<n). If the new gene set has lower entropycompared to the current gene set, it is chosen to replace the currentgene set. However, with a small probability also proportional to thecurrent temperature, small increases in the entropy are allowed. As thetemperature falls over time, the value of k keeps dropping, effectivelysearching in the local neighborhood of the gene set. The SA algorithmterminates when the temperature reaches zero or when the current geneset remains unmodified for a larger number of iterations.

Changing multiple genes at high temperatures once allows the algorithmto search for solutions across large regions of the state space.However, as the temperature fails, the algorithm searches inincreasingly localized regions thus allowing for “fine tuning” of thesolution.

The entropy minimization techniques can be implemented by way ofoff-the-shelf software such as MATLAB, JAVA, C++, or any other software.Machine language or other low level languages can also be utilized.Multiple processors working in parallel can also be utilized.

Once the minimum entropy gene set is identified, then the state-counttable is extracted 201, as shown in FIG. 2. If the conditional entropyfor a particular gene set is found to be exactly zero, this implies thatthe joint expression levels of the members of that gene module determinethe existence of disease with absolute certainty under the assumption ofthe probabilistic model derived from the relative frequencies. Thishappens whenever, for the 2″ states in the corresponding state-counttable, at least one of the counts N₀ and N₁ is zero.

In practice, when this occurs, a large number of states are encounteredoccasionally. A reliable association of disease based on these rarelyencountered states cannot be made, and including them in the model willresult in “overfitting,” so they are treated as noise and are ignored,in favor of the states which predominantly correspond to disease. Inaccordance with the disclosed subject matter, the definition for suchstates can be that they have been encountered at least three times andthat the number of corresponding cancerous tissues can be at least fourtimes larger than the count of corresponding healthy tissues, i.e., N₁≧3and N₁≧N₀. Note that other definitions can be used, as needed. Logicfunctions or relationships are useful because they can lead tobiological discovery when understood in the context of additionalbiological knowledge. Therefore, whenever the entropy for a gene set isfound to be exactly zero, the size of the gene set can be decreased byone, and the minimum-entropy gene set of that size can be selected.Thus, the output of the EMBP analysis contains a gene module for whichthe conditional entropy can be close, but not equal, to zero.

Once a gene module has been identified, and the expression states 202for that module, as shown in FIG. 2, that are predominantly associatedwith cancer have been determined as described above, the following canbe then addressed: given the gene expression states associated withdisease, find the simplest logical rule that connects the expressionlevels in the gene module with the presence of disease, as depicted inFIG. 2.

This can be referred to as “Boolean parsimony”, because the logical rulewill be identified by the “most parsimonious Boolean function.” Thedefinition for this logic function is one containing the operators AND,OR and NOT, which minimizes a “cost,” defined as the total number oflogic variables appearing in the expression 203. The most parsimoniousfunction is one which is essentially the least complex because it“costs” the least 205.

In Boolean algebra, each logic variable can take the value of either 0(false) or 1 (true), the operator AND corresponds to multiplication, andthe operator OR corresponds to addition. The symbol of prime (′)following the logic variable designates the operator NOT. For example,ab+a′b′+ab′ means (a AND b) OR [(NOT a) AND b] OR [a AND (NOT b)] andthe “cost” (as defined above) of this Boolean function is 6, becauseeach of the variables a and b appears three times 203. This Booleanexpression happens to be logically equivalent to a+b′, meaning: a OR(NOT b). The latter expression is more parsimonious than the former,because its “cost” is equal to 2, as each of the letters a and b appearsonce. The reason for using Boolean parsimony is that the biological roleof each gene becomes more immediately clear if the Boolean expressioncontains the corresponding logic variable either once or a few times.The above definition of Boolean parsimony was selected because the logicfunctions AND, OR and NOT often have straightforward potentialbiological interpretations. The issue can be easily resolved manuallywhen the size of the gene set is less than five, using Karnaugh maplogic design methodology (as shown in FIG. 6). Otherwise, Booleanminimization programs such as Espresso can be used 204. Most of themretain the “sum of products” structure of the Boolean expression, butfurther minimization can be accomplished using heuristic algorithms.

As illustrated in the embodiment depicted in FIG. 3, a system inaccordance with the disclosed subject matter can include a processor ormultiple processors 304 and a computer readable medium 301 coupled tothe processor or processors 304. The computer readable medium includesdata such as factors and outcomes 302 and can also include programs forentropy minimization and boolean parsimony 303. The system leads to aparsimonious boolean expression of factors that relate to outcomes 305.Multiple processors 304 working in parallel can also be utilized.

In one embodiment of the disclosed subject matter, two differentprostate cancer datasets are used. The first prostate cancer microarrayexpression data contains gene expression profiles for 102 prostatetissues, of which 52 are cancerous and 50 are healthy, available athttp://www-genome.wi.mit.edu/MPR/prostate. The gene expression profilesin scaled average difference units are produced using HG-U95A Affymetrixmicroarrays with probes for 12,600 genes. This dataset will henceforthbe referred to as the “EMBP dataset,” because it can be used to applyEMBP analysis. A second independently derived dataset, also containingscaled average difference units referring to 34 tissue samples, out ofwhich 25 are cancerous and 9 are healthy can be also obtained athttp://www.gnf.org/cancer/prostate and used for validating the genemodules and logic functions estimated over the EMBP dataset. This latterdataset will be referred to as the “validation dataset.”

The continuous-valued data of the EMBP dataset can be binarized usingseveral thresholds ranging from −30 to 225 in increments of 15 andestimated the minimum entropy gene modules for each of them. For eachthreshold value gene modules of size n=1, 2, 3 and 4 were considered.For n=4, the entropy values occasionally went down to precisely zero dueto overfitting. On the other hand, several gene sets of size 3 withentropy values less than 0.20, which is low note that H(0.97)=0.20,meaning that if the conditional entropy is 0.20 then, on the average,each state is associated with either cancer or health with probability97%. Therefore, n=3 was selected to be the number of genes included inthese gene modules.

The results of are shown in FIG. 5A. The thresholds for which theminimum entropy values were below 0.20 for n=3 are 30, 45, 60, 75 and90. The minimum entropy gene modules for these thresholds along withtheir entropies are listed in Table 3, below.

TABLE 3 Gene Module Threshold a b c Entropy 30 SPINK2 TMSL8 RBP1 0.1915545 HPN ENTPD1 NELL2 0.14302 60 NCF4 HPN PGM1 0.11587 75 HPN MCM3AP GSTP10.16287 90 HLA-DQB1 FNBP1 DF 0.13267

In the following example, official gene symbols are used, and Table 4,below, contains a legend with the corresponding accession numbers,aliases and brief gene descriptions.

TABLE 4 Symbol Accession Alias/Description COL4A6 D21337 Collagen, typeIV, alpha 6 CYP1B1 U03688 Cytochrome P450, family 1, subfamily B,polypeptide 1 DF M84526 Adipsin, D component of complement ENTPD1AJ133133 Ectonucleoside triphosphate diphosphohydrolase 1 FNBP1 AB011126KIAA0554, Formin binding protein 1 GSTP1 U12472 GlutathioneS-transferase pi HLA-DQB1 M81141 Major histocompatibility complex, classII, DQ beta 1 HPN X07732 Hepsin, transmembrane protease, serine 1HIST1H1E M60748 H1F4, Histone 1, H1e KRT6E L42611 Keratin 6E MCM3APAB011144 KIAA0572, MCM3 minichromosome maintenance deficient 3 (S.cerevisiae) associated protein NCF4 AL008637 P40PHOX, neutrophilcytosolic factor 4 (derived from precise chip probe) NELL2 D83018NEL-like 2 (chicken) protein PGM1 M83088 Phosphoglucomutase 1 RBP1M11433 Cellular retinol binding protein 1 SERPINB5 U04313 Maspin, Serpinpeptidase inhibitor, clade B (ovalbumin), member 5 SPINK2 X57655 Serinepeptidase inhibitor, Kazal type 2 (acrosin-trypsin inhibitor) TMSL8D82345 TMSNB, Thymosin-like 8

To evaluate the significance of these minimum entropy values, entropyminimization can be performed over ten random permutations of the tissueclass labels. In other words, while keeping the number of healthy andcancerous tissues constant to 50 and 52 respectively, healthy (0) andcancerous (1) labels to the individual tissue profiles are randomlyassigned. The entropy minimization algorithm can be performed on therandomly permuted data and the average minimum entropies for n=3 can beestimated for the thresholds 30, 45, 60, 75 and 90 for the sameexpression matrix of the EMBP dataset. The estimated averages of theentropies are shown in the heavy black line in FIG. 5B. Notably, theentropy values for the randomly permuted data for n=3 are much higherthan those estimated on the actual, dataset, and even significantlyhigher than the entropy values of the actual data with n=2, indicatingthat the gene modules identified by entropy minimization on the actualdata have real biological meaning, rather than being due to chance.

The most parsimonious Boolean functions for these five gene modules canthen be estimated. FIG. 6 contains the Karnaugh maps from which thefunctions are derived, together with the corresponding Boolean functionsand their accuracy if these simple functions can be used forclassification on the “EMBP dataset.” For convenience, these Booleanfunctions are also formulated in words in FIG. 6, where “presence” and“absence” of a gene refer, for simplicity, to the presence or absence ofmRNA from the gene. Furthermore, it was found that gene ENTPD1 in FIG.6B can be replaced by gene HIST1H1E, and that gene NCF4 in FIG. 6C canbe replaced by gene KRT6E. In both cases, these substitutions yieldidentical results.

The genes mentioned in FIG. 6 should not be seen as individual “prostatecancer-related genes,” which, in traditional approaches, are found to beeither consistently overexpressed or consistently underexpressed incancer. Instead, each of the identified genes can be seen as a member ofa synergistic gene module, as evidenced by the formulation of thecorresponding Boolean function. To further clarify the fundamentaldifference between the two approaches, the following are provided foreach of the five identified gene modules, derived from simpleobservation of the counts in each Karnaugh map, each of which canprovide hints for its biological explanation:

(a) Absence of RBP1, if accompanied by either presence of TMSL8 orabsence of SPINK2 is associated with cancer in 50 out of 53 suchtissues. However, in the simultaneous absence of TMSL8 and presence ofSPINK2, absence of RBP1 is not associated with cancer. On the contraryall 11 such tissues are healthy.

(b) Presence of NELL2 is associated with health in 37 out of 39 suchtissues, even if HPN (normally associated with cancer) is present.Simultaneous presence of HPN and NELL is associated with health in 17out 19 such tissues.

(c) Presence of NCF4 is associated with health in all 11 such tissues,even if HPN (normally associated with cancer) is present: simultaneouspresence of HPN and NCF4 is associated with health in all 9 suchtissues. The same formulation is true if NCF4 is replaced by KRT6E.

(d) If either HPN is present or MCM3AP is absent, then the absence ofGSTP1 is associated with cancer, as all such 39 tissues are cancerous.However, if HPN is absent and MCM3AP is present, then the absence ofGSTP1 is not associated with cancer, as all such 9 tissues are healthy.

(e) In the absence of HLA-DQB1, absence of DF is associated with cancerin 48 tissues out of 50, and presence of DF is associated with health in36 tissues out of 38. However, in the presence of HLADQB1, absence of DFis instead associated with health, as all 11 such tissues are healthy.

While the classification performance of the Boolean functions from EMBPanalysis is high over the dataset upon which the results were derived(FIG. 6), it is important to validate these results over previouslyunseen gene expression profiles. The five gene modules of FIG. 6 aretested using their corresponding Boolean functions on the “validationdataset”. For that task, the expression levels of the validation datasetare binarized.

A simple transformation of the form y=b was used to map the EMBP datasetthresholds to the validation dataset thresholds, where x representsthresholds over the EMBP dataset and y represents thresholds over thevalidation dataset. To estimate the coefficients a and b, the geneexpression values are averaged over all tissues for the 12,600 genescommon to both datasets. Thus two vectors x and y of length 12,600, areobtained whose elements were the mean gene expression levels across thetissues belonging to the EMBP and validation datasets, respectively.These two vectors are used to calculate the least squares estimate forthe coefficients a and b. The mean value and the 95% confidence boundfor the two coefficients were found to be: a=8.25+/−0.088,b=92.12+/−9.06. The thresholds are transformed using several values of aand b within the 95% confidence bounds, and the values yielding thehighest found classification performance are selected, which were a8.338 and b 92.12. Table 5, below summarizes the results for each of thefive Boolean functions outlined in FIG. 6 over the validation dataset.

TABLE 5 EMBP Validation Classification dataset Gene Module Booleandataset Accuracy Specificity Sensitivity Threshold a b c FunctionThreshold (%) (%) (%) 30 SPINK2 TMSL8 RBP1 bc′ 342.26 85.29 77.78 88 45HPN ENTPD1 NELL2 ab′c′ 467.33 94.12 100 92 60 NCF4 HPN PGM1 a′bc′ 592.4094.12 100 92 75 HPN MCM3AP GSTP1 a(c′ + b′) 717.47 97.06 100 96 90 HLA-FNBP1 DF a′b′c′ 842.54 85.29 77.78 88 DQB1

Remarkably, the classification accuracy of the simple three-gene(two-gene in one case) Boolean functions in the validation dataset wereconsistently high, exceeding 90% in most cases, indicating that EMBPanalysis accurately extracted universally valid prostate cancer-relatedfeatures.

The genes in the modules resulting from EMBP analysis are notco-regulated, because, if they were, then each of them alone wouldprovide much of the information that all of them provide, therefore adifferent gene would be a more appropriate partner, as it would providecomplementary information. Nevertheless, these genes are typicallyrelated by a shared common “theme,” in which they are playingsynergistic roles. For example, two genes can appear because they areboth required for the activation of a particular cancer-causing pathway.The cause-and-effect relationship connecting disease and the presence ofparticular genes in a gene module is not clear from the results ofquantitative analysis alone, and the Boolean functions can be seen asapproximations when they are based on a relatively small set of inputdata, as in this case.

Coupled with additional biological knowledge, however, the results ofEMBP analysis can help infer disease-related pathways, which, in turncan help develop therapeutic interventions. This methodology uses theclues provided by the results to create assumptions involving additionalgenes. Assuming that each gene module has a “story” to tell, thecombination of these “stories” into an integrated scenario combiningmany genes can be attempted. In accordance with an aspect of thedisclosed subject matter, two examples of this methodology are presentedbelow.

The first focus is on the three-gene module with the lowest overallconditional entropy (0.1159) (FIG. 6C), consisting of genes {HPN, NCF4,PGM1}. Hepsin (HPN) is a serine protease that is overexpressed in mostprostate cancers. Recent evidence indicates that hepsin convertssingle-chain prohepatocyte growth factor into biologically activetwo-chain hepatocyte growth factor. The hepatocyte growth factor (HGF)is a ligand for Met, a known proto-oncogene receptor tyrosine kinase,suggesting that this functional link between hepsin and the HGF/Metpathway can be related to tumor progression. Furthermore, HGF protectscell against oxidative stress-induced apoptosis. These results suggestthat hepsin may promote tumor progression by inhibiting the apoptoticmechanisms that are normally activated in cells after they becomecancerous as a result of damage caused by oxidative stress.

Interestingly, both of the other members of the module (NCF4 and PGM1)have also been related to oxidative stress, strengthening the abovehypothesis. Phosphoglucomutase is inhibited under oxidative stress. Theabsence of PGM1 (as in the Boolean function of FIG. 6C) could thereforeresult from oxidative stress. On the other hand, NCF4, also known asP40PHOX, is known to downregulate, under some conditions, theNADPH-oxidase, a phagocyte enzyme system that creates asuperoxide-producing “oxidative burst” in response to invasivemicroorganisms. In this case, local oxidative stress would result fromthe reduced levels of P40PHOX activity.

Taken together, the above observations resulting from the techniquesdisclosed herein are useful to researchers and are consistent with theBoolean function of FIG. 6C. The absence of NCF4, if accompanied byother unknown factors, permits activation of the NADPH-oxidase, whichcould be aberrant, i.e., not necessarily responding to the presence ofinvasive microorganisms. If this happens, then the resulting oxidativeburst, evidenced by PGM1 downregulation, is damaging to the cell, and isnormally accompanied by triggering apoptotic mechanisms, which, however,are inhibited by the activated HGF resulting from the presence ofhepsin. The damaged surviving cell may then become cancerous as a resultof additional mutations.

Such an interpretation may be partially true or even not true at all.However, its credibility can be strengthened if a similar theme isencountered in other gene modules in accordance with the techniquesdisclosed herein. For example, as noted above, the same conditionalentropy (0.1159) with the same Boolean function results if gene NCF4 isreplaced with gene KRT6E (keratin 6E). It is known that mutations inkeratin genes can prime cells to oxidative injury. In that case, KRG6Eis absent due to its mutation, and the resulting oxidative injury is notstemming from NADPH-oxidation, but is still manifested by the absence ofPGM1, and the apoptotic mechanisms are still inhibited by the presenceof hepsin.

There are many more gene modules that are revealed by EMBP analysis inaddition to those indicated in FIG. 6. Any variety and number of genemodules can be used that have low entropy.

There are many more gene modules that can be revealed by EMBP analysisin addition to those indicated in FIG. 6. The analysis can be repeatedfor various different values of the threshold parameter, whichcorresponds to different minimum microarray measurement levels thatwould be considered to be indicative of a gene being turned on. Theresults obtained for each threshold value describes a slightly differentaspect of the underlying biological process leading to cancer. The sumof these results would then allow biologists to piece together elementsof the biological pathway along with their interrelationships.

A notable feature of the EMBP method of the disclosed subject matter isthat it can be systems-based, in the sense that it considers thesynergistic contributions of sets of genes, rather than individualgenes. As a result, the optimal gene module may not be a subset of theoptimal gene module of size n−1, because the n members of the lattermodule may interact synergistically towards predicting disease in amanner that cannot be achieved if any one of the n members is removed.

The EMBP analysis of the disclosed subject matter provides anopportunity for fruitful cross-disciplinary collaboration, in whichbiologists use the “clues” resulting from the computational results toinfer potential pathways, which they can validate with geneticexperiments, as well as suggest further computational experiments. Forexample, if it is desired to identify which genes play synergistic roleswith another particular gene in terms of causing disease, the presenceof that gene can be “frozen” and the other genes in a module minimizingthe entropy can be identified. Furthermore, the approach of thedisclosed subject matter can immediately suggest to researchers novelpotential therapeutic methods that would not be possible withtraditional individual-gene approaches. For example, two genes thatappear in the same Boolean function can be targeted by combining twoalready existing drugs targeting each of the genes.

The EMBP analysis of the disclosed subject matter is a significant newtool for medical researchers working synergistically with future effortsof diseased tissue genome sequencing. For example, a Boolean functionsuch as a′bc would suggest to a researcher the possibility that gene amay be inactivated due to its mutation or to hypermethylation of itspromoter, as previously discussed regarding KRT6E and GSTP1,respectively. This observation may provide motivation to the researcherto sequence gene a in diseased tissues.

When EMBP analysis is attempted in datasets in which one of the twoclassification tissues had about 20 samples, there may be a finding ofzero entropy with only one or two genes, which would be not provide newinformation compared to the individual gene ranking traditionalapproaches. However, if several hundred tissues are used in eachclassification set, then the gene modules, in accordance with thepresent subject matter, will contain a large number of genes, and theresulting Boolean functions, derived after running on high-endprocessors (including Boolean parsimony with heuristic methods), willaccurately provide clues to researchers for inferring pathways.

The foregoing merely illustrates the principles of the disclosed subjectmatter. Various modifications and alterations to the describedembodiments will be apparent to those skilled in the art in view of theteachings herein. It will thus be appreciated that those skilled in theart will be able to devise numerous techniques which, although notexplicitly described herein, embody the principles of the disclosedsubject matter and are thus within the spirit and scope of the disclosedsubject matter.

1. A method for predicting an effect of a drug on a subject by selectingfactors from a data set of measurements, the measurements includingvalues of the factors and outcomes, comprising: identifying two or morefactors that are jointly associated with one or more outcomes from thedata set, wherein the two or more factors include two or more genes andthe one or more outcomes include the effect of the drug; analyzing eachof the two or more factors to determine at least one interaction thereinwith respect to an outcome; identifying interactions, if any, from thedetermined at least one interaction that correlate the data from the twoor more factors with the effect of the drug; and predicting the effectof the drug on a subject by assessing the presence or absence of theinteractions correlated with the effect of the drug on the subject. 2.The method of claim 1, wherein the data includes gene expression datacomprising expression levels for each of the two or more genes.
 3. Themethod of claim 1, wherein the data includes the presence or absence ofsingle nucleotide polymorphisms (SNPs) in each of the two or more genes.4. The method of claim 1, wherein the two or more factors comprise amodule of factors.
 5. The method of claim 4, wherein the at least oneinteraction comprises a structure of interactions.
 6. The method ofclaim 4, wherein the at least one interaction comprises a logicfunction.
 7. The method of claim 6, wherein the two or more genescomprise a module of genes.
 8. The method of claim 7, wherein the moduleof genes comprise a smallest module of genes with joint expressionlevels that can be used for the prediction of the effect of the drug onthe subject with high accuracy.
 9. The method of claim 8, wherein thelogic function comprises a simplest logic function connecting the genesto achieve the prediction.
 10. A method for predicting an effect of adrug on a subject by selecting two or more genes from gene data,comprising: discretizing the gene data; identifying two or more genesfrom the selected two or more genes having a minimal conditionalentropy; identifying an interaction, if any, that correlates the genedata for the two or more identified genes with the effect of the drug;and predicting the effect of the drug on the subject by assessing thepresence or absence of the interaction correlated with the effect of thedrug on the subject.
 11. The method of claim 10, wherein the gene dataincludes gene expression data comprising expression levels for each ofthe two or more genes.
 12. The method of claim 10, wherein the gene dataincludes the presence or absence of SNPs in each of the two or moregenes.
 13. The method of claim 10, wherein the gene expression data isderived from at least one microarray of gene expression data.
 14. Themethod of claim 10, wherein the two or more genes comprise a module ofgenes.
 15. The method of claim 10, wherein the interaction is modeledusing a most parsimonious Boolean function.
 16. A system for predictingan effect of a drug on a subject by selecting two or more genes fromgene data, comprising: at least one processor, and a computer readablemedium coupled to the at least one processor, having stored thereoninstructions which when executed cause the processor to: discretize thegene data; choose a single threshold for each of the two or more genes;identify the two or more genes from the selected two or more geneshaving a minimal conditional entropy; identify an interaction, if any,that correlates the gene data for the two or more genes with the effectof the drug; and predict the effect of the drug on the subject byassessing the presence or absence of the interaction correlated with theeffect of the drug on the subject.
 17. The system of claim 16, whereinthe gene data includes gene expression data comprising expression levelsfor each of the two or more genes.
 18. The system of claim 16, whereinthe gene data includes the presence or absence of SNPs in each of thetwo or more genes.
 19. The system of claim 16, wherein the geneexpression data is derived from a microarray of gene expression data.20. The system of claim 16, wherein the two or more genes comprise amodule of genes.
 21. The system of claim 16, wherein the interactioncomprises a most parsimonious Boolean function.
 22. A system forpredicting an effect of a drug on a subject by selecting factors from adata set of measurements, each measurement comprising values of thefactors and outcomes, comprising: at least one processor, and a computerreadable medium coupled to the at least one processor, having storedthereon instructions which when executed cause the at least oneprocessor to: identify two or more factors that are jointly associatedwith one or more outcomes from the data, wherein the two or more factorsinclude two or more genes and the one or more outcomes comprise theeffect of the drug; analyze each of the two or more factors to determineat least one interaction therein with respect to an outcome; identifyinteractions, if any, from the determined at least one interaction thatcorrelate the data from the two or more factors with the effect of thedrug; and predict the effect of the drug on the subject by assessing thepresence or absence of the interactions correlated with the effect ofthe drug on the subject.
 23. The system of claim 22, wherein the dataincludes gene expression data comprising expression levels for each ofthe two or more genes.
 24. The system of claim 22, wherein the dataincludes the presence or absence of SNPs in each of the two or moregenes.
 25. The system of claim 22, wherein the two or more factorscomprise a module of factors.
 26. The system of claim 25, wherein the atleast one interaction comprises a structure of interactions.
 27. Thesystem of claim 25, wherein the at least one interaction comprises alogic function.
 28. The system of claim 22, wherein the two or moregenes comprise a module of genes.
 29. The system of claim 28, whereinthe module of genes comprises a smallest module of genes with jointexpression levels that can be used for the prediction of the effect ofthe drug with high accuracy.
 30. The system of claim 29, wherein thelogic function comprises the simplest logic function connecting thegenes to achieve the prediction.